Dados e R
Data Wrangling & DataViz

Encontro 2 | 19/08/20024
Henrique Costa | Métodos Estratégicos em FinQuant

Dados e R

Funções

  • Recipes allow chefs to cook up tasty treats
    • Recipes call for ingredients
    • Recipes involve one or more steps
    • Steps transform ingredients into treats
  • Functions are like customizable recipes
    • Functions call for inputs (“arguments”)
    • Functions involve one or more lines of code
    • Code transforms inputs into outputs
    • Using functions requires parentheses (usually)

out <- f(in1, in2)

Functions Live Coding

# USECASE: Function can perform a task more easily and readably

# TEMPLATE: output <- function_name(input)

9 ^ (1 / 2)

x <- sqrt(9)
x

# ==============================================================================

# LESSON: We can also use functions to transform objects

y <- 9

sqrt(y)

# ==============================================================================

# LESSON: We can even use functions to transform the result of calculations

2 / 3

round(2 / 3)

# ==============================================================================

# LESSON: We can customize what a function does using arguments

# TEMPLATE: output <- function_name(argument, argument_name = argument_value)

round(2 / 3, digits = 2)

round(2 / 3, digits = 3)

# ==============================================================================

# LESSON: Some arguments are optional because they have default values

round(2 / 3) # the default value for digits is 0

round(2 / 3, digits = 0)

Vectors

  • Vectors combine similar objects into a collection
    • I like to imagine a train pulling multiple cars
    • A vector is one object with many sub-objects
    • We refer to each sub-object as an element
  • Some functions transform each element in turn
    • Double the amount of cargo in every train car
  • Some functions summarize across elements
    • Calculate the total cargo across all train cars

v <- c(1, 2, 3)

Vectors Live Coding

# LESSON: We can combine multiple elements into a vector

# TEMPLATE: vector_name <- c(element1, element2, element3)

x <- 4 9 16 25 # error

x <- c(4, 9, 16, 25)
x

y <- c(2, 3)
y

# ==============================================================================

# LESSON: We can also combine multiple vectors and elements

c(x, y)

c(x, y, 20)

# ==============================================================================

# USECASE: Math operators will transform each element individually

x + 1

x * 3

x # but again, this won't be saved unless you use assignment

# ==============================================================================

# USECASE: Some functions will also transform each element individually

sqrt(x)

log(x)

# ==============================================================================

# USECASE: Other functions will summarize the vector with a single number

length(x)

sum(x)

mean(x)

Strings

  • When talking to R, we need a way to distinguish
    • Object/function names (e.g., the mean function)
    • Text/character data (e.g., the word mean)
  • Strings are R’s way of storing text data
    • Strings can store any characters (no rules!)
    • Strings are created and displayed with quotes
  • R has great tools for working with strings
    • Strings can be collected into vectors
    • Special functions can transform strings

name <- "John Doe"

Strings Live Coding

# USECASE: Strings are the main way to store character data in R
 
my_color <- red # error

my_color <- "red" # correct

# ==============================================================================

# USECASE: Strings can also store symbols not allowed in object names

dye <- "red#40"
dye

dyes <- c("red#40", "blue#02")
dyes

# ==============================================================================

# PITFALL: Many operations you can do to numbers won't work for strings

dyes + 1 # error

mean(dyes) # error

# ==============================================================================

# USECASE: But other operations work for both or even just for strings

length(dyes)

nchar(dyes)

dyes2 <- toupper(dyes)
dyes2

Packages

  • Cookbooks are a great way to learn to cook
    • They contain lots of recipes and instructions
    • Browse an online bookstore for a cookbook
    • Order it to add it to your personal bookshelf
    • To use, pull the cookbook off the shelf
  • Packages are like cookbooks for R
    • They contain helpful functions and datasets
    • Browse an online repository for a package
    • Install it to add it to your personal library
    • To use, load the package from the library

library("pkg_name")

Packages Live Coding

# USECASE: The stringr package adds a function to fix capitalization

students <- c("mary anne", "BENjamin", "Lee")

# ==============================================================================

# PITFALL: But we can't use that function without installing the package

str_to_title(students) # error

# ==============================================================================

# LESSON: Installing a package using RStudio

# - RStudio > Extras pane > Packages tab > Install button

# ==============================================================================

# PITFALL: We also need to load the package before we can use it

str_to_title(students) # error

# ==============================================================================

# LESSON: We load the package using library()

library("stringr")
str_to_title(students) #finally works!

# ==============================================================================

# LESSON: We can also keep our packages updated using RStudio

# RStudio > Extras pane > Packages tab > Update button

Wrangle I

Tidy Data Principles

  • There are many ways to store data
  • We will be learning the tidy data format
    • Data should be rectangular
    • Each variable has its own column
    • Each observation has its own row
    • Each value has its own cell

Other Data Advice

  • Name all variables in the first row
    • This is called a header row
  • Avoid merged cells for data storage
    • These are okay for communication
  • Avoid empty cells whenever possible
    • Mark missing data as NA
  • Avoid formatting-as-data for storage
    • e.g., non-redundant color-coding

Tidying Example 1

Not Tidy

Name Ann Bob Cat Dom
Age 13 10 11 11
Weight 56.4 46.8 41.3 43.3

❌ Here, each row is a variable and each column is an observation.

Tidy

Name Age Weight
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Here, each column is a variable and each row is an observation.

Tidying Example 2

Not Tidy

Names: Ann Bob Cat Dom
Age Weight
13 56.4
10 46.8
11 41.3
11 43.3

❌ Here, we have data that is not rectangular because the Names variable has its own row.

Tidy

Name Age Weight
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Here, we have made the data rectangular by moving the Names variable to its own column.

Tidying Example 3

Not Tidy

country year cases / population
Afghanistan 1999 NA / 19987071
2000 2666 / 20595360
Brazil 1999 37737 / 172006362
2000 80488 / 174504898
China 1999 212258 / 1272915272
2000 213766 / 1280428583

❌ Here, we have merged cells and two values stored in a single cell.

Tidy

country year cases population
Afghanistan 1999 NA 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.

Tidying Example 4

Not Tidy

student grade
Amber 91.5 A-
Bristol 86.2 B
Charlene 94.0 A
Diego 89.3 B+
Legend: Psych. Major, Psych. Minor

❌ Here, we have a missing variable name and formatting-as-data.

Tidy

student psych grade letter
Amber major 91.5 A-
Bristol minor 86.2 B
Charlene major 94.0 A
Diego NA 89.3 B+

✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.

Tidying Example 5

Not Tidy

student grade letter
Amber 91.5 A-
Bristol* 94.2 A
Class Summary
As 2 Yay!
Bs 0
*Grade was revised.

❌ Here, we have two types of data in one file and a footnote as data.

Tidy

student grade letter revised
Amber 91.5 A- FALSE
Bristol 94.2 A TRUE
letter count notes
A 2 Yay!
B 0

✔️ Here, we have split the data into two separate tables and added the revised and notes variables.

Long vs. Wide Format

Wide Format

date Boeing Amazon Google
2009-01-01 $173.55 $174.90 $174.34
2009-01-02 $172.61 $171.42 $170.04

✔️ Here, we have a wide format where each observation is a date.

Long Format

date stock price
2009-01-01 Boeing $173.55
2009-01-01 Amazon $174.90
2009-01-01 Google $174.34
2009-01-02 Boeing $172.61
2009-01-02 Amazon $171.42
2009-01-02 Google $170.04

✔️ Here, we have a long format where each observation is the combination of a date and a stock.

Tibbles

  • R works particularly well with tidy data
  • We store tidy data in data frames or tibbles
    • Tibbles are just fancier data frames
      (i.e., they have a few extra features)
  • To use tibbles, we need the tidyverse package
  • Tibbles are constructed from one or more vectors
    • The vectors must have the same length
    • They can contain different types of data

Vectors

We start with three separate vector objects that all have the same length.

We set it up so that the \(n\)-th car in each train corresponds to the same observation.

Tibble

Then we combine the vectors into a single tibble (or data frame) object.

Now, as the tibble moves around, the variables always stay together.

Tibbles Live Coding

# SETUP: Install and load the tidyverse package

# Extras pane > Packages tab > Install

library(tidyverse)

# ==============================================================================

# LESSON: Create a tibble from vectors

x <- c(10, 20, 30, 40)
x

y <- x * 2 - 4
y

my_tibble <- tibble(x, y)
my_tibble

# ==============================================================================

# USECASE: You can mix different types of vectors in a single tibble

first_names <- c("Adam", "Billy", "Caitlyn", "Debra")

age_years <- c(12, 13, 10, NA)

guests <- tibble(first_names, age_years)
guests

# ==============================================================================

# TIP: To save time, you can also create the vectors in the tibble call

gradebook <- tibble(
  grade = c(95, 83, 90, 76),
  letter = c("a", "b", "a-", "c")
)
gradebook

# ==============================================================================

# PITFALL: Don't try to combine tibbles with different lengths

y <- c(1, 2, 3)
x <- c("a", "b")

tibble(y, x) #error

# ==============================================================================

# LESSON: However, the exception is R will "recycle" a single value

tibble(y, x = "a")

# ==============================================================================

# LESSON: You can "extract" a vector from a tibble using $

mytibble <- tibble(x = c(1, 2, 3, 4, 5), y = "test")

mytibble$x

mytibble$y

# ==============================================================================

# PITFALL: Don't try to extract a vector that doesn't exist

mytibble$z #error

Importing and Exporting

  • Data is usually stored in data files
    • Importing files into R is called reading
    • Exporting files from R is called writing
  • A convenient data file type is a CSV
    • This stands for comma-separated values
    • A CSV file is easy to share with other people
  • The tidyverse package can read/write CSVs
    • Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)

Read/Write Live Coding

# SETUP: Load the tidyverse package (if you haven't yet)

library(tidyverse)

# ==============================================================================

# USECASE: Create a tibble and write it to a file

gradebook <- tibble(
  id = c(123, 456, 789),
  grade = c("A", "B", "A")
)
gradebook

write_csv(gradebook, file = "gradebook.csv")

# NOTE: You can see the new file in Extras pane > Files tab.
# You can open the file in another program (e.g., Microsoft Excel).
# You can also email this file to someone else to share it.

# ==============================================================================

# PITFALL: Don't swap the order of the tibble and the file

write_csv("gradebook.csv", gradebook) # error

# ==============================================================================

# USECASE: Read in a file containing data

old_gradebook <- read_csv("gradebook.csv")
old_gradebook

# NOTE: read_csv() will examine and guess the data type of each variable.
# You can tell it the data type of each variable, but that is more advanced.

# ==============================================================================

# PITFALL: Don't use the read.csv() and write.csv() functions

old_gradebook <- read.csv("gradebook.csv") # not a tibble
old_gradebook

Wrangle II

Basic wrangling verbs

  • tidyverse provides tools for wrangling tibbles
    • These functions are named after verbs
    • So if you name your objects after nouns
    • …your code becomes easier to read
Noun(noun) ❌ Verb(noun) ✔️
blender(fruit) blend(fruit)
screwdriver(screw) drive(screw)
boxcutter(box) cut(box)

Column-focused verbs

  • Select retains only certain columns/variables
    • select(TBL, VAR_KEEP, -VAR_DROP)
  • Mutate adds or transforms columns/variables
    • mutate(TBL, NEW_VAR = OLD_VAR / 1000)
  • Rename changes the names of columns/variables
    • rename(TBL, NEW_NAME = OLD_NAME)
  • Relocate changes the order of columns/variables
    • relocate(TBL, VAR_MOVE, .after = OTHER_VAR)

Select Live Coding

# SETUP: Load package and inspect example tibble

library(tidyverse) # includes the dplyr package
starwars

# ==============================================================================

# USECASE: Retain only the specified variables

sw <- select(starwars, name)
sw
sw <- select(starwars, name, sex, species)
sw

# ==============================================================================

# PITFALL: Don't forget to save the change with assignment

select(starwars, name, sex, species)
starwars # still includes all variables

# ==============================================================================

# USECASE: Retain all variables between two variables

sw <- select(starwars, name, hair_color:eye_color)
sw

# ==============================================================================

# USECASE: Retain all variables except the specified ones

sw <- select(starwars, -sex, -species)
sw
sw <- select(starwars, -c(sex, species))
sw
sw <- select(starwars, -c(hair_color:starships))
sw

Mutate Live Coding

# SETUP: Create example tibble

patients <- tibble(
  id = c("S1", "S2", "S3"),
  feet = c(6, 5, 5),
  inches = c(1, 7, 2),
  pounds = c(176.3, 124.9, 162.6)
)
patients

# ==============================================================================

# USECASE: Add one or more variables

p2 <- mutate(patients, sex = c("M", "F", "F"))
p2

ages <- c(32, 41, 29)
p2 <- mutate(patients, ages = ages)
p2

p2 <- mutate(
  patients, 
  sex = c("M", "F", "F"), 
  ages = ages
)
p2

# ==============================================================================

# USECASE: Compute variables

p2 <- mutate(patients, height = feet + inches / 12)
p2

p2 <- mutate(patients, ln_pounds = log(pounds))
p2

# ==============================================================================

# USECASE: Overwrite variables

patients <- mutate(patients, height = height / 3.281)
patients

# ==============================================================================

# USECASE: Duplicate variables

p2 <- mutate(patients, weight = pounds)
p2 # both weight and pounds exist

Rename / Relocate Live Coding

# USECASE: Change the name of one or more variables

starwars

sw <- rename(starwars, Character = name)
sw

sw <- rename(starwars, height_cm = height, mass_kg = mass)
sw

# ==============================================================================

# PITFALL: Don't swap the order and try old_name = new_name

sw <- rename(starwars, name = Character) # error

# ==============================================================================

# USECASE: Move variables before or after another variable

starwars

sw <- relocate(starwars, species, sex, .before = height)
sw

sw <- relocate(starwars, species, sex, .after = name)
sw

# ==============================================================================

# PITFALL: Don't forget the period!

sw <- relocate(starwars, sex, before = height) 
sw # height was accidentally renamed to before

Row-focused verbs

  • Arrange sorts rows based on their values
    • arrange(TBL, VAR_SORT_UP)
    • arrange(TBL, desc(VAR_SORT_DOWN))
    • arrange(TBL, VAR_SORT_1ST, VAR_SORT_2ND)
  • Filter retains certain rows based on criteria
    • filter(TBL, DBL_CRIT > 0)
    • filter(TBL, STR_CRIT == "A")
    • filter(TBL, CRIT1, CRIT2)

Arrange Live Coding

# USECASE: Sort observations by a variable

starwars

sw <- arrange(starwars, height)
sw # sorted by height, ascending

sw <- arrange(starwars, name)
sw # sorted by name, alphabetically

# ==============================================================================

# USECASE: Sort observations by a variable, in reverse order

sw <- arrange(starwars, desc(height))
sw # sorted by height, descending

sw <- arrange(starwars, desc(name))
sw # sorted by name, reverse-alphabetically

# ==============================================================================

# USECASE: Sort observations by multiple variables

sw <- arrange(starwars, hair_color, mass)
sw # sorted by hair_color, then ties broken by mass

Filter Live Coding

# USECASE: Retain only observations that meet a criterion

sw <- filter(starwars, mass > 100)
sw # only observations with mass greater than 100

sw <- filter(starwars, mass <= 100)
sw # only observations with mass less than or equal to 100

sw <- filter(starwars, species == "Human")
sw # only observations with species equal to Human

sw <- filter(starwars, species != "Human")
sw # only observations with species not equal to Human

# ==============================================================================

# PITFALL: Don't try to use a single = for testing equality

sw <- filter(starwars, height = 150) # error

sw <- filter(starwars, height == 150) # correct
sw 

# ==============================================================================

# PITFALL: Don't forget that R is case-sensitive

sw <- filter(starwars, species == "human")
sw # no observations left (because it should be Human)

# ==============================================================================

# USECASE: Retain only observations that meet complex criteria

sw <- filter(starwars, mass > 100 & height > 200)
sw # only observations with mass over 100 AND height over 200

sw <- filter(starwars, height < 100 | hair_color == "none")
sw # only observations with height under 100 OR hair_color equal to none

# ==============================================================================

# PITFALL: Don't forget to complete both conditions

sw <- filter(starwars, mass > 100 & < 200) # error

sw <- filter(starwars, mass > 100 & mass < 200) # correct
sw

# ==============================================================================

# PITFALL: Don't try to equate a string to a vector

sw <- filter(starwars, species == c("Human", "Droid")) # error

sw <- filter(starwars, species %in% c("Human", "Droid")) # correct
sw

Filter Cheatsheet

Symbol Description Num Chr
< Less than Yes No
<= Less than or equal to Yes No
> More than Yes No
>= More than or equal to Yes No
== Equal to Yes Yes
!= Not equal to Yes Yes
%in% Found in Yes Yes
& Logical And Yes Yes
| Logical Or Yes Yes

Wrangle III

Pipes & Pipelines

  • How can we do multiple operations to an object?
    1. x <- 10
    2. x2 <- sqrt(x)
    3. x3 <- round(x2)
  • This works but is cumbersome and error-prone
  • A better approach is to use pipes and pipelines
    • x3 <- 10 |> sqrt() |> round()
  • I like to read |> as “and then…”
    • “Take 10 and then sqrt() and then round()”

Pipes Live Coding

# SETUP: Enable the pipe operator shortcut

# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"

# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)

# ==============================================================================

# LESSON: The pipe pushes objects to a function as its first argument

# TEMPLATE: x |> function_name() is the same as function_name(x)

x <- 10

y <- sqrt(x)
y

y <- x |> sqrt()
y

# ==============================================================================

# PITFALL: Don't forget to remove the object from the function call

x |> sqrt(x) # wrong

x |> sqrt() # correct

# ==============================================================================

# USECASE: You can still use arguments when piping

z <- round(3.14, digits = 1)
z

z <- 3.14 |> round(digits = 1)
z

# ==============================================================================

# USECASE: Pipes are useful with tibbles and wrangling verbs

starwars

sw <- select(starwars, name, species, height)
sw

sw <- starwars |> select(name, species, height)
sw

# ==============================================================================

# PITFALL: Don't add a pipe without a step after it

sw <- starwars |> select(name, species, height) |> # error

Pipelines Live Coding

# USECASE: You can chain multiple pipes together to make a pipeline

x <- 10 |> sqrt() |> round()
x

# ==============================================================================

# TIP: If you want to see the output of a pipeline, you can pipe to print()

x <- 10 |> sqrt() |> round() |> print()

# ==============================================================================

# TIP: To make your pipelines more readable, move each step to a new line

x <- 
  10 |> 
  sqrt() |> 
  round() |>
  print()

# ==============================================================================

# PITFALL: Don't put the pipe at the beginning of a line, though

x <- 
  10 
  |> sqrt()
  |> round()
  |> print() # error

# ==============================================================================

# USECASE: Chain together a series of verbs to flexibly wrangle data

tallones <- 
  starwars |> 
  select(name, species, height) |> 
  rename(height_cm = height) |> 
  mutate(height_ft = height_cm / 30.48) |>  
  filter(height_ft > 7) |> 
  arrange(desc(height_ft)) |>  
  print()

Factors

  • Factors are used to represent categorical data
    • Factors have multiple possible levels
    • Levels are discrete and mutually-exclusive
  • Sometimes categories are unordered (nominal)
    • Action or Comedy or Drama
    • Asia or Europe or North America
  • Sometimes categories are ordered (ordinal)
    • Mild < Medium < Hot
    • XS < S < M < L < XL

Factors Live Coding

# USECASE: Ask 10 kids to order 1: nuggets, 2: pizza, or 3: salad

food <- c(2, 2, 1, 2, 1, 2, 1, 1, 2, 2)
food

# ==============================================================================

# LESSON: We can turn this vector into a factor with factor()

food2 <- factor(food, levels = c(1, 2, 3))
food2

food3 <- factor(food, levels = c(1, 2, 3),
                labels = c("nuggets", "pizza", "salad"))
food3

# ==============================================================================

# USECASE: We can also quickly and easily count each level with table()

table(food3)

# ==============================================================================

# PITFALL: Don't confuse levels and labels

food4 <- factor(food, labels = c(1, 2, 3),
                levels = c("nuggets", "pizza", "salad"))
food4 # full of <NA> because it can't find these levels

# ==============================================================================

# USECASE: You can also just enter strings directly (as self-labels)

genre <- c("pop", "metal", "pop", "rock", "rap", "rap", "pop", "rock")
genre

genre2 <- factor(genre) # observed levels will be assigned alphabetically
genre2

table(genre2)

# ==============================================================================

# LESSON: If ordinal, enter levels low-to-high and add ordered = TRUE

salsa <- c("hot", "mild", "medium", "mild", "medium", "medium")

salsa2 <- factor(salsa, 
                 levels = c("mild", "medium", "hot"), 
                 ordered = TRUE)
salsa2 

# NOTE: We may want to visualize or model ordinal factors differently

# ==============================================================================

# USECASE: Working with factors in a tibble

cereal <- read_csv("cereal.csv")
cereal

cereal2 <- mutate(cereal, mfr = factor(mfr), type = factor(type))
cereal2

table(cereal2$mfr)

table(cereal2$type)

Missing Values

  • Sometimes your data will have missing values
    • Perhaps these were never collected
    • Perhaps the values were lost/corrupted
    • Perhaps the participant didn’t respond
  • We need to tell R which values are missing
    • To do so, we set those values to NA
    • Functions from tidyverse make this easy
  • Missingness is often “contagious” in R
    e.g., a vector with NA has an unknown mean

Missing Values Live Coding

# SETUP: We will need tidyverse for the read and mutate functions

library(tidyverse)

# ==============================================================================

# PITFALL: Number codes for missingness will mess up calculations in R

heights <- c(149, 158, -999) # here we use -999 to represent a missing value

range(heights)

mean(heights)

log(heights) # our missing value is no longer -999

# ==============================================================================

# USECASE: Use NA for missingness instead

heights2 <- c(149, 158, NA)
heights2

log(heights2) # the NA stayed an NA (due to contagiousness)

# ==============================================================================

# LESSON: Use na.rm = TRUE to do a summary function ignoring the NAs

mean(heights2) # the mean is an NA (due to contagiousness)

mean(heights2, na.rm = TRUE)

range(heights2, na.rm = TRUE)

# ==============================================================================

# USECASE: Dealing with missing values in tibbles

cereal <- read_csv("cereal.csv")

cereal$rating

range(cereal$rating)

# ==============================================================================

# LESSON: Use na_if() to convert specific values to NA while mutating

cereal2 <- mutate(cereal, rating = na_if(rating, -999))

cereal2$rating

range(cereal2$rating, na.rm = TRUE)

# ==============================================================================

# LESSON: Use read_csv(na) to convert specific values to NA while reading

cereal3 <- read_csv("cereal.csv", na = "-999")

cereal3$rating

range(cereal3$rating, na.rm = TRUE)

Wrangle IV

Summarize

  • Although we store data about many observations…
  • …we often want to summarize across observations
    • This is like folding the tibble down to one row
  • We’ve seen functions that summarize vectors
    • length(), sum(), min(), max()
    • mean(), median(), sd(), var()
  • summarize() lets us use them on tibbles
    • It works very similarly to mutate()
    • It always creates a tibble as output

Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# USECASE: Summarize the typical sales

my_summary <- 
  sales |> 
  summarize(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# PITFALL: Don't use summary() instead of summarize()

my_summary <- 
  sales |> 
  summary(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print() # not a tibble

# ==============================================================================

# USECASE: Use more than one summary function

my_summary <- 
  sales |> 
  summarize(
    total_items = sum(items),
    total_spent = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# USECASE: Use counting functions

my_counts <- 
  sales |> 
  summarize(
    n_sales = n(),
    n_customers = n_distinct(customer),
    n_stores = n_distinct(store)
  ) |> 
  print()

Group Summarize

  • We can also summarize a tibble by group
    • This is like folding the tibble multiple times
    • Specifically, we fold down to one row per group
  • The syntax for summarize is identical
    • The only difference is to the tibble
    • We first pass it through group_by()
    • Pipelines make this very easy

Group Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# LESSON: We pass a tibble through group_by to group it

sales

sales |> group_by(store) # note the display says "grouped"

# ==============================================================================

# USECASE: We can then summarize and get stats per group

sales |> 
  group_by(store) |> 
  summarize(
    customers = n_distinct(customer),
    items_sold = sum(items),
    total_sales = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  )

# ==============================================================================

# SETUP: Let's get a larger, more realistic dataset

# Extra pane > Packages tab > Install > nycflights13

library("nycflights13")

flights

# ==============================================================================

# USECASE: Find the carrier with the lowest average delays

flights |> 
  group_by(carrier) |> 
  summarize(m_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(m_delay)

# ==============================================================================

# LESSON: We can also group by multiple variables

# USECASE: Let's find the day of the year with the most flights

flights |> 
  group_by(month, day) |> 
  summarize(n_flights = n()) |> 
  arrange(desc(n_flights))

Visualize I

What is a graphic?

A data visualization expresses data through visual aesthetics.

Describing Graphics

Some simple graphics are easy to describe and may even have ready names.

Describing Graphics

A grammar of graphics will help us describe more complex graphics.

The Grammar of Graphics

  • The grammar of graphics is a set of rules for describing and creating data visualizations
  • To make our data visual (and therefore put our highly evolved occipital lobes to work)…
    • We connect variables to visual qualities
    • We represent observations as visual objects
  • This requires some fundamental elements
    • We will first learn about them in lecture
    • We will then apply them in R using {ggplot2}

Data

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Graphics require data (e.g., tibbles), which describe observations using variables.

Aesthetic Mappings

Graphics require aesthetic mappings, which connect data variables to visual qualities.

Scales

Graphics require scales, which connect specific data values to specific aesthetic values.

Geometric Objects

Graphics require geometric objects (geoms), which represent the observations.

ggplot2 Basics

  • The ggplot2 package is a part of tidyverse
    • No need to install or load it separately
    • It plays nicely with tibbles and wrangling
  • It implements the grammar of graphics in R
    • The “gg” stands for “grammar of graphics”
    • Thus, we will need to provide all four elements
  • We will create a pseudo-pipeline of commands
    • However, we will use + rather than |>
    • This is because {ggplot2} predates the R pipe

ggplot2 Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# LESSON: First, set the data to a tibble
p <- ggplot(data = mpg)
p

# ==============================================================================

# LESSON: Next, set the aesthetic mappings with aes()

p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
p

# ==============================================================================

# TIP: You can leave off the optional argument names

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

# ==============================================================================

# LESSON: Next, set the positional scales

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  )
p

# ==============================================================================

# LESSON: Finally, add a point geom

p <- 
  ggplot(mpg, aes(x = displ, y = hwy)) + 
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  ) +
  geom_point()

# ==============================================================================

# TIP: If you leave off the scales, R will try to guess

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p

# ==============================================================================

# LESSON: We can also customize the geom with arguments

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(color = "red", shape = "square", size = 2)
p

Basic Layering

  • ggplot2 uses a layered grammar of graphics
    • We can keep stacking geoms on top
  • Layering adds a lot of possibilities
    • We can convey more complex ideas
    • We can learn more about our data
  • But we can still describe these graphics
    • Just describe each layer in turn
    • And describe the layers’ ordering

Basic Layering Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Add a smooth geom (i.e., line of best fit)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

# ==============================================================================

# USECASE: Add a line geom (i.e., connecting points)

economics

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point() +
  geom_line(color = "orange", size = 1)

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_line(color = "orange", size = 1) +
  geom_point()

# ==============================================================================

# USECASE: Add reference line geoms

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_vline(xintercept = 7.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_abline(intercept = 4000, slope = 0.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

Working with Color

  • Color scales come in two main types:
    • Discrete scales have separate colors
      • Best with factor variables
    • Continuous scales form a gradient
      • Best with numeric variables
  • There are two ways to control color:
    • You can map color to a variable
      • It will take on different values
    • You can set color to a value
      • It will take on one value only

Color Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Continuous color scales work well with numeric variables

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4)

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4) +
  scale_color_continuous(type = "viridis")

# ==============================================================================

# USECASE: Use a discrete color scale with categorical variables

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  scale_color_discrete(
    name = "Drivetrain", 
    breaks = c("4", "f", "r"), 
    labels = c("Four Wheel", "Front Wheel", "Rear Wheel")
  )

# ==============================================================================

# PITFALL: Don't forget to set categorical variables as factors

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) + 
  geom_point() # R guesses you want a continuous scale

ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) + 
  geom_point() + 
  scale_color_discrete(name = "Cylinders")

# ==============================================================================

# LESSON: Set a geom's color aesthetic to make it always that color

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "red")

# ==============================================================================

# PITFALL: However, do this inside of geom() not aes()

ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + 
  geom_point() #unintended

# ==============================================================================

# LESSON: If you both set and map color, the setting will win

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point(color = "blue") 

Themes

  • Themes control how non-data elements look
    • e.g., how thick to draw the gridlines
    • e.g., where to position the legend
  • Complete themes change many elements at once
    • Some are built into ggplot2
    • Others come in R packages
    • {papaja} provides theme_apa()
  • Individual elements can be customized too

Themes Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- 
  ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  labs(title = "Fuel Efficiency")
p

# ==============================================================================

# USECASE: Apply a "complete" theme

p + theme_bw()

p + theme_classic()

p + theme_dark()

# ==============================================================================

# LESSON: More more precise control, we can use theme()

p + theme(legend.position = "top")

p + theme(plot.title = element_text(color = "purple", face = "bold"))

p + theme(panel.grid = element_blank())

# NOTE: There are a lot of elements to learn, so use a cheatsheet!

Exporting Graphics

  • We may need to export graphics from R
    • e.g., for a paper, poster, or presentation
  • This job is handling fantastically by ggsave()
    • We can create many types of files
    • We can customize the exact size
  • I recommend .png for most daily purposes
    • For publishing, I prefer .pdf or .svg
    • They retain perfect quality at any zoom
    • You can send these files to most publishers

Exporting Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth() +
  labs(x = "Engine Displacement", y = "Highway MPG")
p

# ==============================================================================

# USECASE: Save a specific ggplot object to a file

ggsave(filename = "pfinal.png", plot = p)

# ==============================================================================

# LESSON: Specify the size of the file to create

ggsave(filename = "pfinal2.png", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# LESSON: Just change the extension to create a different file type

ggsave(filename = "pfinal2.pdf", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# PITFALL: Creating a very large file may lead to small text

ggsave(filename = "p_poster.png", plot = p, 
       width = 12, height = 8, units = "in")

# ==============================================================================

# TIP: You can quickly increase the text size using base_size

p2 <- p + theme_grey(base_size = 24)

ggsave(filename = "p_poster2.png", plot = p2,
       width = 12, height = 8, units = "in")